IME672A PROJECT : PREDICTION OF EMPLOYER'S SALARY BASED ON US CENSUS DATA

Installing Libraries

Importing Libraries

Import the Dataset

Data Description

 Numeric attributes : age, capitalgain, capitalloss and hoursperweekr (4)

 Categorical/Nominal attributes :  nativecountry, maritalstatus, relationship ,occupation and workclass (5)

 Symmetrical binary attributes : race and sex (2)

 Ordinal type attribute : education (1)


DATA PREPROCESSING

Data cleaning

There are no null values in our Dataset. But We see some "?" values in some columns.

finding the special characters in the data frame

HANDLING IMBALANCED DATA

Since,Dataset is imbalanced, We will Balanced it using SMOTE method.

Handling Outliers

    age (z_score or Standard deviation)

    capital gain (average of different groups)

    capital loss (average of different groups)

    hours per week (z_score or standard deviation)

EXPLORATORY DATA ANALYSIS

Correlation between columns

There is no high correlation between any two numeric features.

Observations from Histogram:

1.We can group the age columns into bins.

2.For capital Gain and Capital Loss,the data is highly skewed which needs to be tackled.

3.The hours per week can also be split into bins.

Salary

Since,our dataset is balanced,so there is approx equal number of employees having salary less than 50k dollars and having salary more than 50k dollars.

Sex

More male employees than female employees.

px.histogram(df, x='sex', title='sex vs. over50k', color='over50k')

There is more numbers of males having salary more than 50k dollars.

Chi-square test for independence between sex and salary

Workclass

72 percent of employees belong to the private workclass category.

There are more employees who are belonging to Private sectors and earning less than 50k dollars.

Chi-square test for independence between Workclass and Salary

Race

87 percentage of total employees are of white races.

The dataset contains majority of employees from White and Black Races while all other races are very few in count, So we'll combine all other race data into one class labeled as "Others"

Chi square test of independence between race and salary

Age

Violen Plot for Age

The violen plot shows maximum number of employees who earns "less than 50k" dollars have age approx 25 and who earn "more than 50k" dollars have age approx 45.

Creating Buckets for Age

There are more adults in our dataset . Also, Negligible amount of old people.

Chi-square for independence between age and salary

Hours Per Week

Creating Buckets for Hours per week

61 percentage of employees belong to the "Less Hours" class.

People who work "Less Hours" tend to have salary more than 50k dollars

Chi-square test for independence between hoursperweek and salary

Marital Status

Relationship

The features "Relationship" and "maritalstatus" have no missing values. Also,there is some overlap between the two such as if the person is husband or wife,then their marital status would be Married. However,as there is no complete overlap, We'll keep both these columns.

Occupation

The distribution of income varies across all the various occupations. The categories are already uniquely identificable and we will keep them as it is.

Chi-square for independence between occupation and salary

Education

we will combine school students from preschool to 12th into one class labeled as "no college/university".

Chi-square for independence between education and salary

Native Country

92 percent of employees comes from United-States.

The majority of employees are from United States.Thus, we can make 'United-States" as one class and rest countries labeled as "others" class.

Chi-square for independence between nativecountry and salary

Capital Gain and Capital Loss

Approx 86 percent of employees have Capital Gain as 0 dollars.

93 percent of employees have Capital loss 0 dollars.

Capital Difference:- Capital Gain + Capital Loss

95 percent of employees belong to the Minor class of "Capital Difference" feature

In minor category, there are more number of employees having salary greater than 50k dollars

Chi-square for independence between Capital_difference and salary

Assigning Salary column with "less than 50k" to 0 and "more than 50k" to 1

Machine Learning Model

Training, Validation and Test Sets

While building real-world machine learning models, it is quite common to split the dataset into three parts:

We will divide the dataset such that 60% of the data for the training set, 20% for the validation set and 20% for the test set. So,we have 75%-25% training-validation split.

Input and Target Columns

Encoding Categorical Data

we need to convert categorical data to binary. A common technique is to use one-hot encoding for categorical columns.

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

Training a Logistic Regression Model

Logistic regression is a commonly used technique for solving binary classification problems. In a logistic regression model:

Making Predictions and Evaluating the Model

The model achieves an accuracy of 82% on the training set. We can visualize the breakdown of correctly and incorrectly classified inputs using a confusion matrix.

The accuracy of the model on the test and validation set are above 82%, which suggests that our model generalizes well to data.

Decision Tree

Training and Visualizing Decision Trees

A decision tree in general parlance represents a hierarchical series of binary decisions:

A decision tree in machine learning works in exactly the same way, and except that we let the computer figure out the optimal structure & hierarchy of decisions, instead of coming up with criteria manually.

The training set accuracy is close to 88% But we can't rely solely on the training set accuracy, we must evaluate the model on the validation set too.

We can make predictions and compute accuracy in one step using model.score

Visualization

Note the gini value in each box. This is the loss function used by the decision tree to decide which features should be used for splitting the data, and at what point the features should be split. A lower Gini index indicates a better split. A perfect split (only one class on each side) has a Gini index of 0.

Feature Importance

Based on the gini index computations, a decision tree assigns an "importance" value to each feature. These values can be used to interpret the results given by a decision tree.

Gaussian Naive Bayesian

Train-Test Split